036134.jpg

Data Collection

Introduction

Data Dictionary

price

price in US dollars (\$326--\$18,823)

carat

weight of the diamond (0.2--5.01)

cut

quality of the cut (Fair, Good, Very Good, Premium, Ideal)

color

diamond colour, from D (best) to J (worst)

clarity

a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

x

length in mm (0--10.74)

y

width in mm (0--58.9)

z

depth in mm (0--31.8)

depth

total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)

table

width of top of diamond relative to widest point (43--95)

Importing All Libraries

Read csv file into panda

Take a look at the data and the first few rows

Some Visualizations and Descriptive Statistics

Data cleanup (handling missing values, duplicates, errors, outliers)

More visualizations again for correlations, heatmaps:

Checking for any missing values or duplicates

Checking for any errors or outliers

Feature creations/ Feature selection (Did we change or add new features or drop a any features)

Feature selection :

Lable encoding the data to get rid of object dtype.

Data Visualization (IS the data ready to be modeled?)

Visualize the relation between columns on a count plot

All data in a single data frame

Train/Test Split

Train the model using training data

Linear Regression with Statsmodels

Applying Logistic Regression

Logistic Regression with Multiple Features

Decision Trees

Random Forests

'carat' and 'y' are the most important features in predicting target

We run the linear regression for comparison with random forest:

K-mean Clustering

SVM

Cross Validation